Evaluating Psychological Research Requires More Than Attention to the N.
نویسندگان
چکیده
Simonsohn (2015) proposed to use effect sizes of high powered replications to evaluate whether lower powered original studies could have obtained the reported effect. His focus on sample size misses that effect size comparisons are informative with regard to a theoretical question only when the replications (i) successfully realize the theoretical variable of interest, which (ii) usually requires supporting evidence from a manipulation check that should (iii) also indicate that the manipulations were of comparable strength. Because psychological phenomena are context sensitive (iv) the context of data collection should be similar and (v) the measurement procedures comparable across studies. (vi) Larger samples are often more diverse in terms of demographics and individual differences, which can further affect effect size estimates. Without attention to these points, high powered replications do not allow inferences about whether lower powered original studies could observe what they reported. Replications are often considered more valid than the original study when they have a larger N. Going beyond this assumption, Simonsohn (2015) proposed to use effect size estimates from high powered replications to determine whether lower powered original studies could have found what they reported: is the phenomenon seen with the “big telescope” of a large-N replication large enough to have been visible with the “small telescope” of the lower-N original study? His conceptual and methodological errors illustrate the pitfalls of a purely statistical focus. Concepts and Manipulations Psychologists conduct experiments to test theories. Just as original studies, replications need to ensure that the theoretically specified variables are realized. Testing feelings-asinformation theory, Schwarz and Clore (1983, Experiment 2) used the first sunny days of spring after a long Midwestern winter and the inevitable return of cold, rainy weather as naturalistic mood manipulations. A mood measure confirmed more positive moods during the former than latter days. As predicted, participants evaluated their lives-as-a-whole more favorably when in a good rather than bad mood. This difference was eliminated when their attention was drawn to the weather, leading them to realize that their current feelings may not be indicative of their general quality of life. A laboratory experiment with different manipulations replicated the interaction of mood x attribution on judgments of life-satisfaction. Other work extended the theoretical rationale to the informational value of other subjective experiences, including arousal, emotions, bodily sensations, and the fluency of mental procedures (for reviews, see Schwarz & Clore, 2003, 2007). Evaluating research -3 Consistent with experimental conventions of the 1980s, Schwarz and Clore’s experiment had a small N and was statistically underpowered. Simonsohn’s figure 2 compares its effect size with effect sizes from two large panel surveys that assessed the covariation of sunshine on the day of interview and respondents’ reports of life-satisfaction. Feddersen, Metcalfe, and Wooden (2012) found a small influence of the weather in Australia, whereas Lucas and Lawless (2013) found none in the United States. Neither of these data sets contained a mood measure, which renders them silent on whether mood influences judgments of life-satisfaction. Simonsohn’s (2015) decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect. Informative theory tests therefore require evidence that the manipulation realized the conceptual variable. Such evidence is provided by measures that assess the conceptual variable, serving as manipulation checks. Put simply, if you don’t know what the mood was, you can’t make inferences about the influence of mood. Comparability and Strength of Manipulations The size of experimental effects depends, in part, on the strength of the manipulation. Even if a manipulation successfully induced a positive mood, its observed impact will vary with the intensity of the mood. Schwarz and Clore took advantage of the upbeat affect associated with the arrival of spring in the Midwestern United States and the dread associated with a temporary return of winter. In Simonsohn’s comparisons, this turns into variations in sunshine and cloud cover per se, independent of season and location. But a sunny summer day in Texas is not the psychological equivalent of a sunny spring day in the Midwest, which renders the data silent on even the most atheoretical variant of the research question: Do similar (!) weather conditions reproduce the original effect? The comparability and strength of experimental manipulations is more often assumed than assessed. Indeed, what qualifies as sufficiently “similar” is often theoretically underspecified. Many psychological theories address how one variable (e.g., mood, motivation, attitude strength) influences another one (e.g., judgment, choice) without fully specifying the determinants of the Evaluating research -4 independent variable itself. Theories of mood and judgment, for example, are silent on what gives rise to a mood in the first place. Hence, the implementation of independent variables is frequently based on a mix of earlier results and personal intuition, further highlighting the need for sensible manipulation checks and converging evidence across different manipulations. Empirically, similarity of the procedures used in the replication and the original study is a major predictor of (non)replication. In the Open Science Collaboration’s (2015) reproducibility project, 11 replications used procedures that the original authors considered inappropriate prior to data collection; 10 of them failed (Open Science Center, 2016). Because the context sensitivity of human cognition and the dynamics of social and cultural change apply to research materials as they apply to other things psychologists study, even technically identical manipulations do not guarantee an equivalent test of the psychological phenomenon when the context changes (for extended discussions, see Fabrigar & Wegener, 2015; Schwarz & Strack, 2014). What is or is not a meaningful change in context is often controversial as the recent discussion about the fidelity of replications in the reproducibility project illustrates (Gilbert, King, Pettigrew, & Wilson, 2016; Open Science Collaboration, 2016). Nevertheless, manipulation checks that may settle the issue are routinely missing in high profile replication efforts. Empirically, the context sensitivity of a phenomenon, rated by experts who are unaware of replication results, predicts its replication likelihood (Van Bavel, Mede-Siedlecki, Brady, & Reinero, 2016) – the less context sensitive the phenomenon, the more likely it is to replicate in another lab. Comparability of Measurement Procedures The size of an observed effect further varies with the level of noise in its measurement. Accordingly, effect size comparisons need to attend to the comparability of the measurement procedures, which often requires attention to the psychology of self-report (Schwarz, 1999). Theories of judgment assume that the impact of a given input decreases with the number of other inputs considered in forming the judgment (Bless, Schwarz, & Wänke, 2003). For example, lifesatisfaction and marital satisfaction correlate r = .32 when the questions are asked in the lifemarriage order, but r = .67 when asked in the marriage-life order, reflecting that a given input has more impact when it has just been brought to mind. This impact decreases when additional relevant inputs are rendered accessible, e.g., from r = .67 to r = .46 when work satisfaction and leisure satisfaction precede the marriage and life questions (Schwarz, Strack, & Mai, 1991). Thus, identical manipulations result in smaller effects when the item of interest is preceded by other items that broaden the range of accessible inputs relevant to the judgment. Evaluating research -5 In the surveys on which Simonsohn draws, life-satisfaction was preceded by numerous other questions in interviews exceeding 80 minutes, bringing many other applicable inputs to mind. Moreover, the surveys used demographically diverse samples and spanned multiple years. In contrast, life-satisfaction and happiness where the first questions in Schwarz and Clore’s experiment, conducted with a homogenous student sample on the same campus during 4 spring days in 1981. Such methodological variables affect variation in the data set and hence the observed effect size of any manipulation. They nevertheless receive little attention in prominent replication projects, which include many experiments in a single data collection. For example, the replication projects of the Open Science Center included 13 experiments in one 20-minute session for “Many Labs 1” (Klein et al., 2014), up to 15 experiments in 30 minutes for “Many Labs 2” (Klein et al., 2015), and 10 experiments in 30 minutes for “Many Labs 3” (Ebersole et al., 2015). Few, if any, of the original studies were conducted in such a format.
منابع مشابه
Psychological assessment of women with Acne Vulgaris
Background: Studies investigating the possible role of personality and emotional factors in acne vulgaris patients have yielded inconsistent and sometimes contradictory results. Objective: The purpose of this study was to assess personality and psychological functioning in women with acne vulgaris and compare them with normal women. Patients and Methods: Forty-seven consecutive female pat...
متن کاملEvaluating the mental and emotional effects of Sahand Bonab thermal power plant greenhouse on rural communities, Case study: Rurlas in Bonab
The increased use of thermal power plants has led to the spread of greenhouse gases in the air and has caused psychological problems for humans. Accordingly, the present study was conducted to measure the pollutants released by Sahand Bonab thermal power plant and to investigate the effects of this pollution on the psychological and psychological pressure of rural residents. The GWP100 method w...
متن کاملSenior Medical Students' Self Evaluation of their Capability in General Competencies in Shiraz University of Medical Sciences
Introduction: Evaluating graduates' proficiencies can provide a helpful reflection of medical education performance and lead to its improvement. The aim of this study was to assess general competencies of final year medical students from their own viewpoints. Methods: In this descriptive–evaluative study, 71 graduating students of Shiraz University of Medical Sciences evaluated themselves base...
متن کاملبررسی مقایسه ای برخی از اختلالات روانی در کودکان 7 تا 12 ساله شاهد، محروم از پدر، جانباز و عادی شهرستان چالوس
Background and purpose: loss of father or being handicapped is a major stressor and can increase mental disorders in children. This research was designed to study the comparison of some mental disorders like attention deficit, hyperactive disorder (ÂD/HD), oppositional defiant disorder (ÔDD), conduct disorder (ÇD), generalized anxiety disorder (GÂD). dysthymia disorder (DD), and major depress...
متن کاملConcepts and Images of Aging From the Perspective of Psychological Components in the Curriculum of the Primary School of the Iranian Educational System: Qualitative Content Analysis
Objectives The present study aims to analyze the content of primary school textbooks in the Iranian educational system based on the psychological components of the elderly and how these books have reproduced the psychological components of the elderly. Methods & Materials The research method was descriptive-content analysis. The statistical population of the study included primary school textb...
متن کاملEvaluating the impact of Environmental Quality Indicators on the degree of humanization in healing environments
During the last 2 decades, the effects of the physical and social environment on the healing process, recovery and well-being of patients, families and staff in hospitals have been proved.There is a growing recognition that healthcare architecture could do more by promoting overall wellness, and this requires expanding the focus to healing.The research on evidence-based design (EBD) has demonst...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Psychological science
دوره 27 10 شماره
صفحات -
تاریخ انتشار 2016